Extracting Precise Data on the Mathematical Content of PDF Documents
نویسندگان
چکیده
As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognition techniques. The data can then be used to improve mathematical parsing methods that transform the mathematics into richer formats such as MathML.
منابع مشابه
Extracting Precise Data from PDF Documents for Mathematical Formula Recognition
As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterised version. This provides more precise information than is available either directly from the PDF file or by traditional character recognit...
متن کاملExtracting anchorable information units from PDF files
Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can creat...
متن کاملA Preprocessing and Analyzing Method of Images in PDF Documents for Mathematical Expression Retrieval
PDF documents are the important information resources for a mathematical expression retrieval system. As a major component of PDF documents, the image objects must be converted to coded form with the help of character recognition and document analysis technology firstly for content based searching. Therefore, the quality of these images becomes the key factor which decides the correctness in th...
متن کاملIntelligent Wrapping from PDF Documents
Wrapping is the process of navigating a data source, semiautomatically extracting data and transforming it into a form suitable for data processing applications. The semi-structured form of web pages, coupled with the availability of business-relevant data, has led to the availability of several established products on the market for wrapping data from the Web. One such approach is the Lixto me...
متن کاملIdentification of embedded mathematical formulas in PDF documents using SVM
With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes,...
متن کامل